0 關於教材授權

本教材之智慧財產權, 屬木刻思股份有限公司所有。

如果有朋友,覺得此教材很棒,希望能分享給朋友,或是拿此教材開課。非常歡迎大家來信至 course@agilearning.io 請求教材的使用授權唷!


1.1 ETL資料清理


1.2 重要套件 packages 與函式 functions

?readLines
?read.table
library(XLConnect)
## Loading required package: XLConnectJars
## XLConnect 0.2-11 by Mirai Solutions GmbH [aut],
##   Martin Studer [cre],
##   The Apache Software Foundation [ctb, cph] (Apache POI, Apache Commons
##     Codec),
##   Stephen Colebourne [ctb, cph] (Joda-Time Java library)
## http://www.mirai-solutions.com ,
## http://miraisolutions.wordpress.com
library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
## 
## Attaching package: 'xlsx'
## 
## The following objects are masked from 'package:XLConnect':
## 
##     createFreezePane, createSheet, createSplitPane, getCellStyle,
##     getSheets, loadWorkbook, removeSheet, saveWorkbook,
##     setCellStyle, setColumnWidth, setRowHeight
library(readxl)
library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(RSQLite)
## Loading required package: DBI

1.3 範例資料準備

setwd("RCourseBankSinoPac")
dir()
##  [1] "data"               "img"                "index.html"        
##  [4] "index.Rmd"          "installPackages.R"  "LICENSE"           
##  [7] "main.css"           "mops.html"          "README.md"         
## [10] "temp.Rmd"           "tryCodes"           "week2handouts.html"
## [13] "week2handouts.Rmd"

1.4 環境設定: rJava 安裝

library(rJava)


1.5 暖身

練習操作:複習R語法

小提示 Tips

  • 工作目錄 working directory

    • getwd

    • setwd

    • dir

  • 自動補齊 autocompletion

    • tab

  • 執行命令 source

    • control + enter

  • 註解增減 comment

    • shift + control + c

library(Lahman)
## View Data
head(Batting)
##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2
## 6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA   NA
## 2  0  NA  NA NA NA   NA
## 3  5  NA  NA NA NA   NA
## 4  2  NA  NA NA NA   NA
## 5  1  NA  NA NA NA   NA
## 6  1  NA  NA NA NA   NA
tail(Batting)
##        playerID yearID stint teamID lgID   G  AB  R   H X2B X3B HR RBI SB
## 99841  zeidjo01   2014     1    HOU   AL  23   1  0   0   0   0  0   0  0
## 99842 zieglbr01   2014     1    ARI   NL  68   1  0   0   0   0  0   0  0
## 99843 zimmejo02   2014     1    WAS   NL  32  55  3  10   1   0  0   1  0
## 99844 zimmery01   2014     1    WAS   NL  61 214 26  60  19   1  5  38  0
## 99845 zobribe01   2014     1    TBA   AL 146 570 83 155  34   3 10  52 10
## 99846 zuninmi01   2014     1    SEA   AL 131 438 51  87  20   2 22  60  0
##       CS BB  SO IBB HBP SH SF GIDP
## 99841  0  0   1   0   0  0  0    0
## 99842  0  0   1   0   0  0  0    0
## 99843  0  2  21   0   0  9  1    0
## 99844  0 22  37   0   0  0  4    6
## 99845  5 75  84   4   1  2  6    8
## 99846  3 17 158   1  17  0  4   12
## View(Batting)
summary(Batting)
##    playerID             yearID         stint           teamID     
##  Length:99846       Min.   :1871   Min.   :1.000   CHN    : 4768  
##  Class :character   1st Qu.:1932   1st Qu.:1.000   PHI    : 4671  
##  Mode  :character   Median :1971   Median :1.000   PIT    : 4621  
##                     Mean   :1963   Mean   :1.077   SLN    : 4581  
##                     3rd Qu.:1996   3rd Qu.:1.000   CIN    : 4438  
##                     Max.   :2014   Max.   :5.000   CLE    : 4402  
##                                                    (Other):72365  
##  lgID             G                AB              R         
##  AA: 1890   Min.   :  0.00   Min.   :  0.0   Min.   :  0.00  
##  AL:45631   1st Qu.: 13.00   1st Qu.:  7.0   1st Qu.:  0.00  
##  FL:  470   Median : 35.00   Median : 58.0   Median :  5.00  
##  NA:  737   Mean   : 51.46   Mean   :150.6   Mean   : 19.98  
##  NL:50639   3rd Qu.: 81.00   3rd Qu.:252.0   3rd Qu.: 30.00  
##  PL:  147   Max.   :165.00   Max.   :716.0   Max.   :192.00  
##  UA:  332                    NA's   :5149    NA's   :5149    
##        H               X2B              X3B               HR        
##  Min.   :  0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:  1.00   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 11.00   Median : 2.000   Median : 0.000   Median : 0.000  
##  Mean   : 39.43   Mean   : 6.654   Mean   : 1.385   Mean   : 2.944  
##  3rd Qu.: 64.00   3rd Qu.:10.000   3rd Qu.: 2.000   3rd Qu.: 3.000  
##  Max.   :262.00   Max.   :67.000   Max.   :36.000   Max.   :73.000  
##  NA's   :5149     NA's   :5149     NA's   :5149     NA's   :5149    
##       RBI               SB                CS               BB        
##  Min.   :  0.00   Min.   :  0.000   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.: 0.000   1st Qu.:  0.00  
##  Median :  4.00   Median :  0.000   Median : 0.000   Median :  3.00  
##  Mean   : 18.04   Mean   :  3.182   Mean   : 1.337   Mean   : 13.88  
##  3rd Qu.: 27.00   3rd Qu.:  2.000   3rd Qu.: 1.000   3rd Qu.: 20.00  
##  Max.   :191.00   Max.   :138.000   Max.   :42.000   Max.   :232.00  
##  NA's   :5573     NA's   :6449      NA's   :28603    NA's   :5149    
##        SO              IBB              HBP               SH        
##  Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:  2.00   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 10.00   Median :  0.00   Median : 0.000   Median : 1.000  
##  Mean   : 21.57   Mean   :  1.23   Mean   : 1.114   Mean   : 2.486  
##  3rd Qu.: 30.00   3rd Qu.:  1.00   3rd Qu.: 1.000   3rd Qu.: 3.000  
##  Max.   :223.00   Max.   :120.00   Max.   :51.000   Max.   :67.000  
##  NA's   :12987    NA's   :41712    NA's   :7959     NA's   :11487   
##        SF             GIDP       
##  Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 0.00   1st Qu.: 0.000  
##  Median : 0.00   Median : 1.000  
##  Mean   : 1.16   Mean   : 3.225  
##  3rd Qu.: 2.00   3rd Qu.: 5.000  
##  Max.   :19.00   Max.   :36.000  
##  NA's   :41181   NA's   :31257
## explore data structure
str(Batting)
## 'data.frame':    99846 obs. of  22 variables:
##  $ playerID: chr  "abercda01" "addybo01" "allisar01" "allisdo01" ...
##  $ yearID  : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
##  $ stint   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ teamID  : Factor w/ 149 levels "ALT","ANA","ARI",..: 136 111 39 142 111 56 111 24 56 24 ...
##  $ lgID    : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ G       : int  1 25 29 27 25 12 1 31 1 18 ...
##  $ AB      : int  4 118 137 133 120 49 4 157 5 86 ...
##  $ R       : int  0 30 28 28 29 9 0 66 1 13 ...
##  $ H       : int  0 32 40 44 39 11 1 63 1 13 ...
##  $ X2B     : int  0 6 4 10 11 2 0 10 1 2 ...
##  $ X3B     : int  0 0 5 2 3 1 0 9 0 1 ...
##  $ HR      : int  0 0 0 2 0 0 0 0 0 0 ...
##  $ RBI     : int  0 13 19 27 16 5 2 34 1 11 ...
##  $ SB      : int  0 8 3 1 6 0 0 11 0 1 ...
##  $ CS      : int  0 1 1 1 2 1 0 6 0 0 ...
##  $ BB      : int  0 4 2 0 2 0 1 13 0 0 ...
##  $ SO      : int  0 0 5 2 1 1 0 1 0 0 ...
##  $ IBB     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ HBP     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SH      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SF      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ GIDP    : int  NA NA NA NA NA NA NA NA NA NA ...
str(attributes(Batting))
## List of 3
##  $ names    : chr [1:22] "playerID" "yearID" "stint" "teamID" ...
##  $ row.names: int [1:99846] 1 2 3 4 5 6 7 8 9 10 ...
##  $ class    : chr "data.frame"
class(Batting)
## [1] "data.frame"
## data shape
dim(Batting)
## [1] 99846    22
ncol(Batting)
## [1] 22
nrow(Batting)
## [1] 99846

2.1 什麼是資料第二問:資料「應該」長怎樣

2.1.1 以第一週 Yahoo Stock 資料為例

library(httr)
library(rvest)
## Loading required package: xml2
## Connector
Target_URL = "http://tw.stock.yahoo.com/d/s/major_2330.html"
res <- GET(Target_URL)
doc_str <- content(res, type = "text", encoding = "big5")

## Parser
if (.Platform$OS.type == "windows"){
  Sys.setlocale(category='LC_ALL', locale='C')
  data_table <- doc_str %>% read_html(encoding = "big-5") %>% html_nodes(xpath = "//table[1]//table[2]") %>% html_table(header=TRUE)
  Sys.setlocale(category='LC_ALL', locale='cht')
  data_table <- apply(data_table[[1]],2,function(x) iconv(x,from = "utf8"))
  colnames(data_table) <- iconv(colnames(data_table), from = "utf8")
}  else{
  data_table <- doc_str %>% read_html(encoding = "big-5") %>% html_nodes(xpath = "//table[1]//table[2]") %>% html_table(header=TRUE)
  data_table <- data_table[[1]]
}
# View(data_table)
data_table
##      買超券商 買進 賣出 買超   賣超券商 買進 賣出  賣超
## 1  台灣東方匯 3318  184 3134   凱基證券 2088 9677 -7589
## 2    摩根大通 2277  186 2091   康和證券  116 4069 -3953
## 3    花旗環球 2571 1124 1447 港商麥格里 1272 5051 -3779
## 4  永豐金證券 2772 1350 1422   國泰證券  255 2377 -2122
## 5    宏遠證券 1296   30 1266 新加坡商瑞 1954 2838  -884
## 6  台灣摩根士 2667 1502 1165 香港上海匯   36  405  -369
## 7    美商高盛 2723 1639 1084       光隆    2    3    -1
## 8    元大證券 2292 1397  895   豐德證券    5    5     0
## 9        美林 2075 1205  870   港商聯昌    0    0     0
## 10   玉山證券  667   28  639   富隆證券    2    2     0
## 11   日盛證券  796  178  618       萬泰    0    0     0
## 12       元富  883  298  585   大鼎證券    0    0     0
## 13   富邦證券  804  303  501   福邦證券    0    0     0
## 14   群益金鼎  537  130  407   全泰證券    0    0     0
## 15   兆豐證券  575  171  404   大和國泰    0    0     0

2.1.2 想想看,這樣的資料有什麼問題?

  • 左右兩欄的表格應該整併成同一欄

  • 並且加入時間欄位、股票名稱

  • 才能塞進資料庫/檔案做儲存或進一步應用

head(data_table)
##     買超券商 買進 賣出 買超   賣超券商 買進 賣出  賣超
## 1 台灣東方匯 3318  184 3134   凱基證券 2088 9677 -7589
## 2   摩根大通 2277  186 2091   康和證券  116 4069 -3953
## 3   花旗環球 2571 1124 1447 港商麥格里 1272 5051 -3779
## 4 永豐金證券 2772 1350 1422   國泰證券  255 2377 -2122
## 5   宏遠證券 1296   30 1266 新加坡商瑞 1954 2838  -884
## 6 台灣摩根士 2667 1502 1165 香港上海匯   36  405  -369
colnames(data_table) <- c("券商", "買進", "賣出", "差距", "券商", "買進", "賣出", "差距")
stockMajor <- rbind(data_table[,1:4], data_table[,5:8])
# stockMajor <- cbind(日期 = c(Sys.Date()), 股票代號 = c("2330"), 股票名稱 = c("台積電"), stockMajor)
stockMajor <- cbind(日期 = c("2015-12-31"), 股票代號 = c("2330"), 股票名稱 = c("台積電"), stockMajor)
# View(stockMajor)
head(stockMajor)
##         日期 股票代號 股票名稱       券商 買進 賣出 差距
## 1 2015-12-31     2330   台積電 台灣東方匯 3318  184 3134
## 2 2015-12-31     2330   台積電   摩根大通 2277  186 2091
## 3 2015-12-31     2330   台積電   花旗環球 2571 1124 1447
## 4 2015-12-31     2330   台積電 永豐金證券 2772 1350 1422
## 5 2015-12-31     2330   台積電   宏遠證券 1296   30 1266
## 6 2015-12-31     2330   台積電 台灣摩根士 2667 1502 1165

練習:Yahoo Stock


2.2 一般文字資料處理


2.2.1 取得資料後的第一步,是「觀察」

  • 首先,觀察檔案的格式、欄位、資料型態

  • 利用 readLines,先把前幾筆資料讀進來觀察

res <- readLines(con = "data/cl_info_other.csv", n = 10L, encoding = "BIG-5")
iconv(res,from='utf8',to='big5')
##  [1] "etl_dt,bank_code,data_dt,bank_nm,mortgage_cnt,mortgage_bal,decorator_hse_cnt,decorator_hse_bal,ln_car_cnt,ln_car_bal,ln_worker_cnt,ln_worker_bal,other_cl_cnt,other_cl_bal"                                       
##  [2] "2013-11-26 22:30:07.971327,020       ,2006-01-01 00:00:00,\xa4\xe9\xb0\u04f7\xe7\xc1J\xb9\xea\xb7~\xbbȦ\xe6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0"                                                              
##  [3] "2013-11-26 22:30:07.974241,N000      ,2006-01-01 00:00:00,\xa5x\xa5_\xbf\xa4\xb2H\xa4\xf4\xb2Ĥ@\xabH\xa5ΦX\xa7@\xaa\xc0,9924.0,16873000000.0,173.0,183000000.0,2.0,1000000.0,953.0,372000000.0,1101.0,309000000.0"
##  [4] "2013-11-26 22:30:07.979319,809       ,2006-01-01 00:00:00,\xb8U\xae\xf5\xb0\u04f7~\xbbȦ\xe6,4051.0,5624000000.0,1329.0,1471000000.0,3128.0,926000000.0,0.0,0.0,1697098.0,76111000000.0"                           
##  [5] "2013-11-26 22:30:07.995118,146       ,2006-01-01 00:00:00,\xa5x\xa4\xa4\xa5\xab\xb2ĤG\xabH\xa5ΦX\xa7@\xaa\xc0,11167.0,18949000000.0,118.0,97000000.0,11.0,0.0,0.0,0.0,1027.0,399000000.0"                         
##  [6] "2013-11-26 22:30:08.015108,115       ,2006-01-01 00:00:00,\xb0\U000b6a65\xab\xb2ĤG\xabH\xa5ΦX\xa7@\xaa\xc0,1551.0,3773000000.0,336.0,470000000.0,178.0,47000000.0,0.0,0.0,2295.0,515000000.0"                     
##  [7] "2013-11-26 22:30:08.052642,N005      ,2006-01-01 00:00:00,\xb0\xaa\xb6\xaf\xb2ĤT\xabH\xa5ΦX\xa7@\xaa\xc0,1903.0,2476000000.0,7828.0,11257000000.0,0.0,0.0,984.0,580000000.0,2866.0,946000000.0"                   
##  [8] "2013-11-26 22:30:08.229513,803       ,2006-01-01 00:00:00,\xc1p\xa8\xb9\xb0\u04f7~\xbbȦ\xe6,35924.0,64191000000.0,11682.0,21458000000.0,49817.0,15844000000.0,277.0,61000000.0,377762.0,13496000000.0"            
##  [9] "2013-11-26 22:30:08.231682,114       ,2006-01-01 00:00:00,\xb0\U000b6a65\xab\xb2Ĥ@\xabH\xa5ΦX\xa7@\xaa\xc0,5351.0,7501000000.0,1192.0,881000000.0,0.0,0.0,0.0,0.0,5281.0,2681000000.0"                            
## [10] "2013-11-26 22:30:08.056627,222       ,2006-01-01 00:00:00,\xbc\xea\xb4\U000bf932Ĥ@\xabH\xa5ΦX\xa7@\xaa\xc0,62.0,144000000.0,22.0,39000000.0,0.0,0.0,0.0,0.0,172.0,79000000.0"
head(res)
## [1] "etl_dt,bank_code,data_dt,bank_nm,mortgage_cnt,mortgage_bal,decorator_hse_cnt,decorator_hse_bal,ln_car_cnt,ln_car_bal,ln_worker_cnt,ln_worker_bal,other_cl_cnt,other_cl_bal"  
## [2] "2013-11-26 22:30:07.971327,020       ,2006-01-01 00:00:00,日商瑞穗實業銀行,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0"                                                          
## [3] "2013-11-26 22:30:07.974241,N000      ,2006-01-01 00:00:00,台北縣淡水第一信用合作社,9924.0,16873000000.0,173.0,183000000.0,2.0,1000000.0,953.0,372000000.0,1101.0,309000000.0"
## [4] "2013-11-26 22:30:07.979319,809       ,2006-01-01 00:00:00,萬泰商業銀行,4051.0,5624000000.0,1329.0,1471000000.0,3128.0,926000000.0,0.0,0.0,1697098.0,76111000000.0"           
## [5] "2013-11-26 22:30:07.995118,146       ,2006-01-01 00:00:00,台中市第二信用合作社,11167.0,18949000000.0,118.0,97000000.0,11.0,0.0,0.0,0.0,1027.0,399000000.0"                   
## [6] "2013-11-26 22:30:08.015108,115       ,2006-01-01 00:00:00,基隆市第二信用合作社,1551.0,3773000000.0,336.0,470000000.0,178.0,47000000.0,0.0,0.0,2295.0,515000000.0"

  • 其他都可以忘記,至少要記得 read.table

2.2.2 利用觀察的結果「提供資訊」增進讀檔的效率

  • 利用 colClasses 指定資料中欄位的大小及資料型態,讓讀檔速度加快(執行時不用去猜資料的大小、格式與資料型態)

# library(RCurl)
# Cl_info = read.table(sep=",", header=TRUE, stringsAsFactors=F, file=textConnection(getURL("https://raw.githubusercontent.com/suensummit/RCourseBankSinoPac/gh-pages/data/cl_info_other.csv")))

Cl_info = read.table(file = "data/cl_info_other.csv", sep = ",", stringsAsFactors = F, header = T, colClasses = c("character", "character", "character", "character", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric"), encoding = "utf8")
head(Cl_info)
##                       etl_dt  bank_code             data_dt
## 1 2013-11-26 22:30:07.971327 020        2006-01-01 00:00:00
## 2 2013-11-26 22:30:07.974241 N000       2006-01-01 00:00:00
## 3 2013-11-26 22:30:07.979319 809        2006-01-01 00:00:00
## 4 2013-11-26 22:30:07.995118 146        2006-01-01 00:00:00
## 5 2013-11-26 22:30:08.015108 115        2006-01-01 00:00:00
## 6 2013-11-26 22:30:08.052642 N005       2006-01-01 00:00:00
##                    bank_nm mortgage_cnt mortgage_bal decorator_hse_cnt
## 1         日商瑞穗實業銀行            0   0.0000e+00                 0
## 2 台北縣淡水第一信用合作社         9924   1.6873e+10               173
## 3             萬泰商業銀行         4051   5.6240e+09              1329
## 4     台中市第二信用合作社        11167   1.8949e+10               118
## 5     基隆市第二信用合作社         1551   3.7730e+09               336
## 6       高雄第三信用合作社         1903   2.4760e+09              7828
##   decorator_hse_bal ln_car_cnt ln_car_bal ln_worker_cnt ln_worker_bal
## 1        0.0000e+00          0   0.00e+00             0      0.00e+00
## 2        1.8300e+08          2   1.00e+06           953      3.72e+08
## 3        1.4710e+09       3128   9.26e+08             0      0.00e+00
## 4        9.7000e+07         11   0.00e+00             0      0.00e+00
## 5        4.7000e+08        178   4.70e+07             0      0.00e+00
## 6        1.1257e+10          0   0.00e+00           984      5.80e+08
##   other_cl_cnt other_cl_bal
## 1            0   0.0000e+00
## 2         1101   3.0900e+08
## 3      1697098   7.6111e+10
## 4         1027   3.9900e+08
## 5         2295   5.1500e+08
## 6         2866   9.4600e+08
str(Cl_info)
## 'data.frame':    9041 obs. of  14 variables:
##  $ etl_dt           : chr  "2013-11-26 22:30:07.971327" "2013-11-26 22:30:07.974241" "2013-11-26 22:30:07.979319" "2013-11-26 22:30:07.995118" ...
##  $ bank_code        : chr  "020       " "N000      " "809       " "146       " ...
##  $ data_dt          : chr  "2006-01-01 00:00:00" "2006-01-01 00:00:00" "2006-01-01 00:00:00" "2006-01-01 00:00:00" ...
##  $ bank_nm          : chr  "日商瑞穗實業銀行" "台北縣淡水第一信用合作社" "萬泰商業銀行" "台中市第二信用合作社" ...
##  $ mortgage_cnt     : num  0 9924 4051 11167 1551 ...
##  $ mortgage_bal     : num  0.00 1.69e+10 5.62e+09 1.89e+10 3.77e+09 ...
##  $ decorator_hse_cnt: num  0 173 1329 118 336 ...
##  $ decorator_hse_bal: num  0.00 1.83e+08 1.47e+09 9.70e+07 4.70e+08 ...
##  $ ln_car_cnt       : num  0 2 3128 11 178 ...
##  $ ln_car_bal       : num  0.00 1.00e+06 9.26e+08 0.00 4.70e+07 ...
##  $ ln_worker_cnt    : num  0 953 0 0 0 984 277 0 0 0 ...
##  $ ln_worker_bal    : num  0.00 3.72e+08 0.00 0.00 0.00 5.80e+08 6.10e+07 0.00 0.00 0.00 ...
##  $ other_cl_cnt     : num  0 1101 1697098 1027 2295 ...
##  $ other_cl_bal     : num  0.00 3.09e+08 7.61e+10 3.99e+08 5.15e+08 ...
  • 輸出形態為 Data Frame

  • file 就是指讀入的檔案路徑

  • sep 指的是欄位分割用的符號,通常 csv 檔案格式是透過,做分割

  • stringsAsFactors 預設是 True, 讓讀入的字串用 factor 儲存,那麼資料就會轉為整數儲存與對照表

  • header 預設是 False,表示第一行是不是表格標頭,輸出的 data.frame 欄位名的 colnames


2.2.3

  • 使用 write.table 寫資料

write.table(x = Cl_info, file = "data/cl_info_other_update.csv", sep = ",", quote = FALSE)

2.3 EXCEL資料處理

(a) gdata

## Need Perl on Windows
library(gdata)
## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
## 
## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
## 
## Attaching package: 'gdata'
## 
## The following objects are masked from 'package:dplyr':
## 
##     combine, first, last
## 
## The following object is masked from 'package:stats':
## 
##     nobs
## 
## The following object is masked from 'package:utils':
## 
##     object.size

(b) XLConnect and xlsx

  • 需要 rJava,另外在處理中文編碼時需要使用 iconv 輔助。

## Need Java/rJava
library(XLConnect)
library(xlsx)

(c) readxl

  • Hadley 出品,開發中,可搭配 dplyrtbl 物件)

## No dependency
library(readxl)

2.3.1 觀察

  • 首先,觀察檔案的格式、欄位、資料型態

複習一下

  • 套件::函式(public)

  • 套件:::函式(private)

  • c-1 使用 readxl::excel_sheets 取得 Excel 檔案的表單資訊(sheet)

# library(readxl)
filename <- "data/10401信用卡重要資訊揭露.xlsx"
sheetNames <- readxl::excel_sheets(filename)
sheetNames
## [1] "10401"
  • c-2 利用 readxl:::xlsx_col_types 觀察資料中欄位的大小及資料型態,讓讀檔速度加快(執行時不用去猜資料的大小、格式與資料型態)

colTypes <- readxl:::xlsx_col_types(path = filename, nskip = 10, n = 1)
colTypes
##  [1] "text"    "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
##  [8] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
  • c-3 再用 readxl::read_excel 讀檔、xlsx::write.xlsx2 寫檔

resxl <- readxl::read_excel(filename, sheet = sheetNames[1], col_names = FALSE, skip = 8, col_types = colTypes)
colnames(resxl) <- c("金融機構名稱", "流通卡數", "有效卡數", "當月發卡數", "當月停卡數", "循環信用餘額", "未到期分期付款餘額", "當月簽帳金額", "當月預借現金金額", "逾期三個月以上帳款占應收帳款餘額含催收款之比率", "逾期六個月以上帳款占應收帳款餘額含催收款之比率", "備抵呆帳提足率", "當月轉銷呆帳金額", "當年度轉銷呆帳金額累計至資料月份")
dataLength <- sum(1 - is.na(resxl$流通卡數))
resxl <- resxl[1:dataLength,]
# colnames(resxl) <- iconv(colnames(resxl), 'utf8', 'big5')
str(resxl)
## Classes 'tbl_df' and 'data.frame':   38 obs. of  14 variables:
##  $ 金融機構名稱                                  : chr  "臺灣銀行" "臺灣土地銀行" "合作金庫商業銀行" "第一商業銀行" ...
##  $ 流通卡數                                      : num  225744 123416 400723 813420 749151 ...
##  $ 有效卡數                                      : num  115423 54717 237647 555037 517241 ...
##  $ 當月發卡數                                    : num  502 654 2606 15215 12108 ...
##  $ 當月停卡數                                    : num  1591 817 7719 8392 9691 ...
##  $ 循環信用餘額                                  : num  241782 204883 602344 1309186 665331 ...
##  $ 未到期分期付款餘額                            : num  14702 28732 193005 915441 807054 ...
##  $ 當月簽帳金額                                  : num  653624 337772 1697919 3554150 3140069 ...
##  $ 當月預借現金金額                              : num  1489 570 4136 14732 1005 ...
##  $ 逾期三個月以上帳款占應收帳款餘額含催收款之比率: num  0.326 0.3382 0.9299 0.2323 0.0819 ...
##  $ 逾期六個月以上帳款占應收帳款餘額含催收款之比率: num  0.165 0.248 0.898 0 0 ...
##  $ 備抵呆帳提足率                                : num  518 1696 251 1484 1368 ...
##  $ 當月轉銷呆帳金額                              : num  318 1895 0 5498 0 ...
##  $ 當年度轉銷呆帳金額累計至資料月份              : num  318 1895 0 5498 0 ...
# View(resxl)

2.3.2 大量同樣格式的檔案,包成 function 批次執行。

setwd("./data/")
filenames <- dir(pattern = "*.xlsx")
readCreditExcel <- function(filename) {
  sheetNames <- readxl::excel_sheets(filename)
  colTypes <- readxl:::xlsx_col_types(path = filename, nskip = 10, n = 1)
  resxl <- readxl::read_excel(filename, sheet = sheetNames[1], col_names = FALSE, skip = 8, col_types = colTypes)
  colnames(resxl) <- c("金融機構名稱", "流通卡數", "有效卡數", "當月發卡數", "當月停卡數", "循環信用餘額", "未到期分期付款餘額", "當月簽帳金額", "當月預借現金金額", "逾期三個月以上帳款占應收帳款餘額含催收款之比率", "逾期六個月以上帳款占應收帳款餘額含催收款之比率", "備抵呆帳提足率", "當月轉銷呆帳金額", "當年度轉銷呆帳金額累計至資料月份")
  dataLength <- sum(1 - is.na(resxl$流通卡數))
  resxl <- resxl[1:dataLength,]
  resxl <- cbind(月份 = substr(filename, 1, 5), resxl)
  return(resxl)
}
filenames
##  [1] "10401信用卡重要資訊揭露.xlsx" "10402信用卡重要資訊揭露.xlsx"
##  [3] "10403信用卡重要資訊揭露.xlsx" "10404信用卡重要資訊揭露.xlsx"
##  [5] "10405信用卡重要資訊揭露.xlsx" "10406信用卡重要資訊揭露.xlsx"
##  [7] "10407信用卡重要資訊揭露.xlsx" "10408信用卡重要資訊揭露.xlsx"
##  [9] "10409信用卡重要資訊揭露.xlsx" "10410信用卡重要資訊揭露.xlsx"
creditData <- lapply(filenames, readCreditExcel)
creditData <- do.call("rbind", creditData)
# View(creditData)
setwd("../")

3.1 R ETL with dplyr

簡介 dplyr

  • 官方的學習文件:vignette

library(dplyr)
vignette(all = TRUE, package = "dplyr")
vignette("introduction", package = "dplyr")
  • 讓 R 使用者可以用更有彈性的方式來處理資料

  • 針對 data.frame 做設計(名稱中的 d)

  • 設計理念

    • 導入資料整理最重要的操作(非常類似 SQL)

    • 快速與方便

    • 支援異質資料源(data.frame 或資料庫中的表格 tbl


dplyr 快速簡介

  • select 對欄位做篩選

  • filter 對資料列做篩選

  • mutate 變更或新增欄位

  • arrange 排序、排列

  • group_by + summarise 分類

  • bind 合併資料表


3.1.1 select

  • Cl_demo1 = select(資料表, 欄位1, 欄位2, 欄位3)

  • 對應的 SQL 語法:

    select data_dt, bank_nm, mortgage_bal from Cl_info ;
# Example 1: select
Cl_demo1 = select(Cl_info, data_dt, bank_nm, mortgage_bal)
head(Cl_demo1)
##               data_dt                  bank_nm mortgage_bal
## 1 2006-01-01 00:00:00         日商瑞穗實業銀行   0.0000e+00
## 2 2006-01-01 00:00:00 台北縣淡水第一信用合作社   1.6873e+10
## 3 2006-01-01 00:00:00             萬泰商業銀行   5.6240e+09
## 4 2006-01-01 00:00:00     台中市第二信用合作社   1.8949e+10
## 5 2006-01-01 00:00:00     基隆市第二信用合作社   3.7730e+09
## 6 2006-01-01 00:00:00       高雄第三信用合作社   2.4760e+09

3.1.2 filter

  • Cl_demo2 = filter(Cl_info, mortgage_bal > 1000000)

  • 對應的 SQL 語法:

    select * from Cl_info where mortgage > 1000000 ;
# Example 2: filter
Cl_demo2 = filter(Cl_info, mortgage_bal > 1000000)
head(Cl_demo2)
##                       etl_dt  bank_code             data_dt
## 1 2013-11-26 22:30:07.974241 N000       2006-01-01 00:00:00
## 2 2013-11-26 22:30:07.979319 809        2006-01-01 00:00:00
## 3 2013-11-26 22:30:07.995118 146        2006-01-01 00:00:00
## 4 2013-11-26 22:30:08.015108 115        2006-01-01 00:00:00
## 5 2013-11-26 22:30:08.052642 N005       2006-01-01 00:00:00
## 6 2013-11-26 22:30:08.229513 803        2006-01-01 00:00:00
##                    bank_nm mortgage_cnt mortgage_bal decorator_hse_cnt
## 1 台北縣淡水第一信用合作社         9924   1.6873e+10               173
## 2             萬泰商業銀行         4051   5.6240e+09              1329
## 3     台中市第二信用合作社        11167   1.8949e+10               118
## 4     基隆市第二信用合作社         1551   3.7730e+09               336
## 5       高雄第三信用合作社         1903   2.4760e+09              7828
## 6             聯邦商業銀行        35924   6.4191e+10             11682
##   decorator_hse_bal ln_car_cnt ln_car_bal ln_worker_cnt ln_worker_bal
## 1        1.8300e+08          2 1.0000e+06           953      3.72e+08
## 2        1.4710e+09       3128 9.2600e+08             0      0.00e+00
## 3        9.7000e+07         11 0.0000e+00             0      0.00e+00
## 4        4.7000e+08        178 4.7000e+07             0      0.00e+00
## 5        1.1257e+10          0 0.0000e+00           984      5.80e+08
## 6        2.1458e+10      49817 1.5844e+10           277      6.10e+07
##   other_cl_cnt other_cl_bal
## 1         1101   3.0900e+08
## 2      1697098   7.6111e+10
## 3         1027   3.9900e+08
## 4         2295   5.1500e+08
## 5         2866   9.4600e+08
## 6       377762   1.3496e+10

3.1.3 mutate

  • Cl_demo3 = mutate(資料表, 新欄位名 = 運算式)

  • 對應的 SQL 語法:

    select mortgage_bal/1000000 as mortage from Cl_info ;
# Example 3: mutate
Cl_demo3 = mutate(Cl_info, mortage = mortgage_bal/1000000)
head(Cl_demo3)
##                       etl_dt  bank_code             data_dt
## 1 2013-11-26 22:30:07.971327 020        2006-01-01 00:00:00
## 2 2013-11-26 22:30:07.974241 N000       2006-01-01 00:00:00
## 3 2013-11-26 22:30:07.979319 809        2006-01-01 00:00:00
## 4 2013-11-26 22:30:07.995118 146        2006-01-01 00:00:00
## 5 2013-11-26 22:30:08.015108 115        2006-01-01 00:00:00
## 6 2013-11-26 22:30:08.052642 N005       2006-01-01 00:00:00
##                    bank_nm mortgage_cnt mortgage_bal decorator_hse_cnt
## 1         日商瑞穗實業銀行            0   0.0000e+00                 0
## 2 台北縣淡水第一信用合作社         9924   1.6873e+10               173
## 3             萬泰商業銀行         4051   5.6240e+09              1329
## 4     台中市第二信用合作社        11167   1.8949e+10               118
## 5     基隆市第二信用合作社         1551   3.7730e+09               336
## 6       高雄第三信用合作社         1903   2.4760e+09              7828
##   decorator_hse_bal ln_car_cnt ln_car_bal ln_worker_cnt ln_worker_bal
## 1        0.0000e+00          0   0.00e+00             0      0.00e+00
## 2        1.8300e+08          2   1.00e+06           953      3.72e+08
## 3        1.4710e+09       3128   9.26e+08             0      0.00e+00
## 4        9.7000e+07         11   0.00e+00             0      0.00e+00
## 5        4.7000e+08        178   4.70e+07             0      0.00e+00
## 6        1.1257e+10          0   0.00e+00           984      5.80e+08
##   other_cl_cnt other_cl_bal mortage
## 1            0   0.0000e+00       0
## 2         1101   3.0900e+08   16873
## 3      1697098   7.6111e+10    5624
## 4         1027   3.9900e+08   18949
## 5         2295   5.1500e+08    3773
## 6         2866   9.4600e+08    2476

3.1.4 arrange

  • Cl_demo4 = arrange(資料表, 欄位1, desc(欄位2)))

  • 對應的 SQL 語法:

    select * from Cl_info order by mortage, data_dt desc ;
# Example 4: arrange
Cl_demo4 = arrange(Cl_info, mortgage_bal, desc(data_dt))
head(Cl_demo4)
##                       etl_dt  bank_code             data_dt
## 1 2014-04-19 07:59:29.921756 093        2014-02-01 00:00:00
## 2 2014-04-19 07:59:29.824336 040        2014-02-01 00:00:00
## 3 2014-04-19 07:59:30.146603 090        2014-02-01 00:00:00
## 4 2014-04-19 07:59:29.907231 N026       2014-02-01 00:00:00
## 5 2014-04-19 07:59:30.024388 023        2014-02-01 00:00:00
## 6 2014-04-19 07:59:30.088385 N023       2014-02-01 00:00:00
##            bank_nm mortgage_cnt mortgage_bal decorator_hse_cnt
## 1   荷蘭商安智銀行            0            0                 0
## 2 中華開發工業銀行            0            0                 0
## 3 加拿大商豐業銀行            0            0                 0
## 4     日商瑞穗銀行            0            0                 0
## 5     泰國盤谷銀行            0            0                 0
## 6   大陸商交通銀行            0            0                 0
##   decorator_hse_bal ln_car_cnt ln_car_bal ln_worker_cnt ln_worker_bal
## 1                 0          0          0             0             0
## 2                 0          0          0             0             0
## 3                 0          0          0             0             0
## 4                 0          0          0             0             0
## 5                 0          0          0             0             0
## 6                 0          0          0             0             0
##   other_cl_cnt other_cl_bal
## 1            0        0e+00
## 2            8        2e+06
## 3            0        0e+00
## 4            0        0e+00
## 5            0        0e+00
## 6            0        0e+00

3.1.5 group_by + summarise

  • Cl_demo5 = summarise(group_by(資料表, 用以分組的欄位), 彙總欄位)

  • 對應的 SQL 語法:

    select mean(mortgage_cnt) from Cl_info group by bank_nm ;
# Example 5: group_by + summarise
Cl_demo5 = summarise(group_by(Cl_info, bank_nm), mean(mortgage_cnt))
head(Cl_demo5)
## Source: local data frame [6 x 2]
## 
##            bank_nm mean(mortgage_cnt)
##              (chr)              (dbl)
## 1     三信商業銀行       6.069959e+03
## 2 上海商業儲蓄銀行       3.268605e+04
## 3 中國信託商業銀行       1.427962e+05
## 4 中國國際商業銀行       3.420471e+04
## 5   中國輸出入銀行       0.000000e+00
## 6 中華開發工業銀行       1.704082e+00

3.2 範例演練

# Example:
creditDataCompete <- select(creditData, 金融機構名稱, 月份, 流通卡數, 有效卡數, 當月發卡數, 當月停卡數, 逾期三個月以上帳款占應收帳款餘額含催收款之比率) %>% mutate(使用率 = (有效卡數/流通卡數)*100, 當月增減 = 當月發卡數 - 當月停卡數) %>% arrange(金融機構名稱)

head(creditDataCompete)
##   金融機構名稱  月份 流通卡數 有效卡數 當月發卡數 當月停卡數
## 1 三信商業銀行 10401    16466    11127        494         73
## 2 三信商業銀行 10402    16556    11162        202         58
## 3 三信商業銀行 10403    16702    11241        304         72
## 4 三信商業銀行 10404    16750    11326        187         60
## 5 三信商業銀行 10405    16804    11376        223         81
## 6 三信商業銀行 10406    16797    11383        241        109
##   逾期三個月以上帳款占應收帳款餘額含催收款之比率   使用率 當月增減
## 1                                      0.3446950 67.57561      421
## 2                                      0.3056296 67.41967      144
## 3                                      0.1372363 67.30332      232
## 4                                      0.1709332 67.61791      127
## 5                                      0.2034689 67.69817      142
## 6                                      0.4154962 67.76805      132

3.3 R 與 Database 的串接:以 SQLite 為例

library(DBI)
# Create an ephemeral in-memory RSQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")

dbListTables(con)
## character(0)
dbWriteTable(con, "Batting", Batting)
## [1] TRUE
dbListTables(con)
## [1] "Batting"
dbListFields(con, "Batting")
##  [1] "playerID" "yearID"   "stint"    "teamID"   "lgID"     "G"       
##  [7] "AB"       "R"        "H"        "X2B"      "X3B"      "HR"      
## [13] "RBI"      "SB"       "CS"       "BB"       "SO"       "IBB"     
## [19] "HBP"      "SH"       "SF"       "GIDP"
# dbReadTable(con, "Batting")

# You can fetch all results:
res <- dbSendQuery(con, "SELECT * FROM Batting WHERE playerID = 'jeterde01'")
dbFetch(res)
##     playerID yearID stint teamID lgID   G  AB   R   H X2B X3B HR RBI SB CS
## 1  jeterde01   1995     1    NYA   AL  15  48   5  12   4   1  0   7  0  0
## 2  jeterde01   1996     1    NYA   AL 157 582 104 183  25   6 10  78 14  7
## 3  jeterde01   1997     1    NYA   AL 159 654 116 190  31   7 10  70 23 12
## 4  jeterde01   1998     1    NYA   AL 149 626 127 203  25   8 19  84 30  6
## 5  jeterde01   1999     1    NYA   AL 158 627 134 219  37   9 24 102 19  8
## 6  jeterde01   2000     1    NYA   AL 148 593 119 201  31   4 15  73 22  4
## 7  jeterde01   2001     1    NYA   AL 150 614 110 191  35   3 21  74 27  3
## 8  jeterde01   2002     1    NYA   AL 157 644 124 191  26   0 18  75 32  3
## 9  jeterde01   2003     1    NYA   AL 119 482  87 156  25   3 10  52 11  5
## 10 jeterde01   2004     1    NYA   AL 154 643 111 188  44   1 23  78 23  4
## 11 jeterde01   2005     1    NYA   AL 159 654 122 202  25   5 19  70 14  5
## 12 jeterde01   2006     1    NYA   AL 154 623 118 214  39   3 14  97 34  5
## 13 jeterde01   2007     1    NYA   AL 156 639 102 206  39   4 12  73 15  8
## 14 jeterde01   2008     1    NYA   AL 150 596  88 179  25   3 11  69 11  5
## 15 jeterde01   2009     1    NYA   AL 153 634 107 212  27   1 18  66 30  5
## 16 jeterde01   2010     1    NYA   AL 157 663 111 179  30   3 10  67 18  5
## 17 jeterde01   2011     1    NYA   AL 131 546  84 162  24   4  6  61 16  6
## 18 jeterde01   2012     1    NYA   AL 159 683  99 216  32   0 15  58  9  4
## 19 jeterde01   2013     1    NYA   AL  17  63   8  12   1   0  1   7  0  0
## 20 jeterde01   2014     1    NYA   AL 145 581  47 149  19   1  4  50 10  2
##    BB  SO IBB HBP SH SF GIDP
## 1   3  11   0   0  0  0    0
## 2  48 102   1   9  6  9   13
## 3  74 125   0  10  8  2   14
## 4  57 119   1   5  3  3   13
## 5  91 116   5  12  3  6   12
## 6  68  99   4  12  3  3   14
## 7  56  99   3  10  5  1   13
## 8  73 114   2   7  3  3   14
## 9  43  88   2  13  3  1   10
## 10 46  99   1  14 16  2   19
## 11 77 117   3  11  7  3   15
## 12 69 102   4  12  7  4   13
## 13 56 100   3  14  3  2   21
## 14 52  85   0   9  7  4   24
## 15 72  90   4   5  4  1   18
## 16 63 106   4   9  1  3   22
## 17 46  81   0   6  4  5   10
## 18 45  90   1   5  6  1   24
## 19  8  10   1   1  0  1    3
## 20 35  87   0   6  8  4   15
dbClearResult(res)
## [1] TRUE
# Or a chunk at a time
res <- dbSendQuery(con, "SELECT * FROM Batting WHERE playerID = 'jeterde01'")
while(!dbHasCompleted(res)){
  chunk <- dbFetch(res, n = 5)
  print(nrow(chunk))
}
## [1] 5
## [1] 5
## [1] 5
## [1] 5
## [1] 0
# Clear the result
dbClearResult(res)
## [1] TRUE
# Disconnect from the database
dbDisconnect(con)
## [1] TRUE

補充:使用 dplyr 與資料庫連接

?src_sqlite

請大家操作範例資料

下載範例資料

res <- read.csv("data/bank/bank-full.csv", header = TRUE, sep = ";")
head(res)
##   age          job marital education default balance housing loan contact
## 1  58   management married  tertiary      no    2143     yes   no unknown
## 2  44   technician  single secondary      no      29     yes   no unknown
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown
## 5  33      unknown  single   unknown      no       1      no   no unknown
## 6  35   management married  tertiary      no     231     yes   no unknown
##   day month duration campaign pdays previous poutcome  y
## 1   5   may      261        1    -1        0  unknown no
## 2   5   may      151        1    -1        0  unknown no
## 3   5   may       76        1    -1        0  unknown no
## 4   5   may       92        1    -1        0  unknown no
## 5   5   may      198        1    -1        0  unknown no
## 6   5   may      139        1    -1        0  unknown no
str(res)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

4.1 今日重點回顧:Take Home Messages

- 資料結構是重要的,三思而後行!

想想看,要排成怎樣的表格,才方便使用?

- 觀察

RData/DataFrame:head + tail + summary

txt/csv:readLines + read.table

Excel:readxl:::xlsx_col_types

- Excel檔案

gdata

XLConnect + xlsx

readxl

- dplyr <-> SQL

- 資料庫

Connect

Query

Disconnect


4.2 Q & A

5 Appendix

5.1.1 Read Excel Files (b) Solution

b-1 使用 XLConnect::loadWorkbook 取得 Excel 檔案的表單資訊(sheet)

b-2 接著利用 xlsx::read.xlsx2,先把前幾行資料讀進來觀察,預先設定欄位的大小及資料型態 colClasses,讓讀檔速度加快(執行時不用去猜資料的大小、格式與資料型態)

b-3 再用 xlsx::read.xlsx2 讀檔(較快),xlsx::write.xlsx2 寫檔